Dark Mode
LINEA is an open-source R library aimed at simplifying and accelerating the development of linear models to understand the relationship between two or more variables.
Linear models are commonly used in a variety of contexts including natural and social sciences, and various business applications (e.g. marketing, finance).
This page covers a basic how to setup the linea library
to analyse a time-series. We’ll cover:
linea
can doThe library can be installed from CRAN using
install.packages('linea') or from GitHub using
devtools::install_github('paladinic/linea'). Once installed
you can check the installation.
print(packageVersion("linea"))
## [1] '0.0.2'
The linea library works well with pipes. Used with dplyr
and plotly, it can perform data analysis and visualization with elegant
code. Let’s build a quick model to illustrate what linea
can do.
We start by importing linea, some other useful
libraries, and some data.
# librarise
library(linea) # modelling
library(tidyverse) # data manipulation
library(plotly) # visualization
library(DT) # visualization
# fictitious ecommerce data
data_path = 'https://raw.githubusercontent.com/paladinic/data/main/ecomm_data.csv'
# importing flat file
data = read_xcsv(file = data_path)
# adding seasonality and Google trends variables
data = data %>%
get_seasonality(date_col_name = 'date',date_type = 'weekly starting') %>%
gt_f(kw = 'prime day',append = T)
# visualize data
data %>%
datatable(rownames = NULL,
options = list(scrollX = TRUE))
Now lets build a model to understand what drives changes in the
ecommerce variable. We can start by selecting a few initial
independent variables
(i.e. christmas,black.friday,trend,gtrends_prime day)
model = run_model(data = data,
dv = 'ecommerce',
ivs = c('christmas','black.friday','trend','gtrends_prime day'),
id_var = 'date')
summary(model)
##
## Call:
## lm(formula = formula, data = trans_data[, c(dv, ivs_t)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -20614 -4527 -437 3066 54638
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43715.771 949.091 46.061 < 2e-16 ***
## christmas 300.689 26.411 11.385 < 2e-16 ***
## black.friday 320.213 39.079 8.194 1.22e-14 ***
## trend 129.083 6.118 21.098 < 2e-16 ***
## gtrends_prime day 181.853 42.949 4.234 3.20e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7425 on 256 degrees of freedom
## Multiple R-squared: 0.7498, Adjusted R-squared: 0.7459
## F-statistic: 191.8 on 4 and 256 DF, p-value: < 2.2e-16
Our next steps can be guided by functions like
what_next(), which will test all other variables in our
data. From the output below, it seems like the variables
covid and offline_media would improve the
model most.
model %>%
what_next()
## # A tibble: 81 × 5
## variable adj_R2 t_stat coef adj_R2_diff
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 offline_media 0.836 11.9 6.44 0.121
## 2 covid 0.815 9.79 192. 0.0922
## 3 year_2020 0.814 9.70 12103. 0.0909
## 4 year_2019 0.780 -6.40 -7130. 0.0461
## 5 christmas_eve 0.777 -6.04 -171045. 0.0415
## 6 week_48 0.770 5.30 21478. 0.0326
## 7 christmas_day 0.768 -5.02 -137143. 0.0294
## 8 week_52 0.765 -4.69 -21249. 0.0259
## 9 promo 0.758 3.67 5.62 0.0159
## 10 year_2017 0.753 2.92 3685. 0.00976
## # … with 71 more rows
Adding these variables to model brings the adjusted R squared to ~88%.
model = run_model(data = data,
dv = 'ecommerce',
ivs = c('christmas','black.friday','trend','gtrends_prime day','covid','offline_media'),
id_var = 'date')
summary(model)
##
## Call:
## lm(formula = formula, data = trans_data[, c(dv, ivs_t)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -21555 -2928 -663 2638 16268
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.785e+04 7.251e+02 65.993 < 2e-16 ***
## christmas 2.811e+02 1.851e+01 15.184 < 2e-16 ***
## black.friday 2.666e+02 2.774e+01 9.610 < 2e-16 ***
## trend 7.912e+01 5.968e+00 13.256 < 2e-16 ***
## gtrends_prime day 1.845e+02 2.977e+01 6.198 2.3e-09 ***
## covid 1.526e+02 1.623e+01 9.401 < 2e-16 ***
## offline_media 5.508e+00 4.758e-01 11.576 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5141 on 254 degrees of freedom
## Multiple R-squared: 0.881, Adjusted R-squared: 0.8782
## F-statistic: 313.3 on 6 and 254 DF, p-value: < 2.2e-16
Now that we have a decent model we can start extracting insights from it. We can start by looking at the contribution of each independent variable over time.
model %>%
decomp_chart()
We can also visualize the relationships between our independent and
dependent variables using response curves. From this we can see that,
for example, when offline_media is 10,
ecommerce increases by ~55. To capture non-linear
relationships (i.e. response curves that aren’t straight lines) see the
Advanced Features page.
model %>%
response_curves(x_min = 0)
The Getting Started page
is a good place to start learning how to build linear models with
linea.
The Advanced Features
page shows how to implement the features of linea that
allow users to capture non-linear relationships.
The Additional Features illustrates page all other functions of the library.
LINEA is being continuously maintained and improved with several features and products under development.
The commercial products being developed:
A few improvements on the way:
linea::what_combo()linea::hill_function()A few features in development:
Other developments: